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SUPERIORITY  OF  FIT 
1 . INTRODUCTION 

Do  discrepancies  between  observations  and  predictions  indicate 
true  population  differences?  A statistical  test,  which  in  the 
Neyman-Pearson  formulation  [4]  can  answer  this  question  either 
yes  (with  the  risk  of  a Type  I error)  or  no  (with  the  risk  of  a 
Type  II  error),  can  also,  in  the  Fisher  formulation  [1,  Chapter  2], 
fail  to  answer  the  question  (an  insignificant  result)  or,  answering 
it,  answer  it  only  in  the  affirmative  (a  significant  result) . 

Neyman  [3]  reviews  the  controversy  between  these  two  opposing 
formulations.  Though  the  tendency  over  the  years  has  been  increas- 
ingly to  adopt  the  Neyman-Pearson  formulation  in  both  textbooks 
and  research  reports,  the  practice  in  specific  subject-matter  areas 
has  not  always  been  consistent.  In  psychology,  for  example,  while 
textbooks  typically  present  the  Neyman-Pearson  formulation,  research 
reports  continue  to  reflect  the  influence  of  Fisher  in  such  state- 
ments as  "The  result  is  significant  (p  < .05)"  or  "The  result  is 
not  significant  (p  > .05),"  where  p indicates  the  probability 
that  the  result  (or  a more  extreme  result)  is  simply  due  to 
sampling  error.  The  purpose  here,  however,  is  not  to  evaluate 
either  formulation,  especially  relative  to  the  other,  but  rather 
to  present  a hybrid  formulation  applicable  particularly  to  the 
evaluation  of  numerical  predictions. 

2.  A HYBRID  FISHER  - NEYMAN-PEARSON  FORMULATION 
This  formulation  rests  on  a widespread  belief  among  philo- 
sophers of  science  [e.g.,  2,  Chapter  4,  particularly  p.  78]  that 
no  numerical  prediction  is  precisely  accurate.  Testing  the 


accuracy  of  a single  numerical  prediction  thus  makes  no  sense: 

Use  of  a large  enough  sample  will  always  lead  to  the  rejection 
of  the  prediction  as  inaccurate.  To  rule  out  tests  of  goodness 
of  fit,  however,  is  not  to  rule  out  tests  of  superiority  of  fit. 
Testing  the  relative  accuracy  of  two  different  numerical  predic- 
tions on  the  same  set  of  observations  does  make  sense.  The  null 
hypothesis  (HQ)  of  such  a test  (the  hypothesis  to  be  nullified, 
in  Fisher's  terminology)  is  simply  that  the  two  predictions  are 
equally  inaccurate.  Equal  inaccuracy  implies  that  the  population 
value  is  midway  between  the  two  predictions  so  that  their  mean  is 
itself  a precisely  accurate  prediction.  The  null  hypothesis  of 
equal  inaccuracy  must  thus  be  false. 

If  this  hypothesis  is  false,  however,  then  one  of  the  two 
predictions  must  be  more  accurate  that  the  other.  Deciding  that 
one  prediction  is  more  accurate  than  the  other  when  the  reverse 
is  true  is  thus  an  all-inclusive  error  having  unconditional,  or 
total,  probability,*! 

<*T  = + <*2P2,  (2.1) 

where  (i=l,2)  is  the  conditional  probability  of  incorrectly 

deciding  that  prediction  i is  less  accurate  and  P^  (i=l,2)  is 
the  (prior)  probability  that  prediction  i is  in  fact  more  accur- 
ate. Fairness  to  both  predictions  requires  that  ai  = a2  = a 
so  that 

<*T  31  a (Pi  + p2)  * 


(2.2) 


if  equal  inaccuracy  is  impossible,  + P2  * 1,  and  thus  aT  = a: 

The  total  probability  of  error  is  equal  to  either  one  of  the  two 
equal  conditional  probabilities  of  error. 

This  formulation  thus  resembles  Fisher^  in  its  exclusion  of 
the  acceptability  of  HQ  and  Neyman-Pearson ’ s in  its  inclusion  of 
the  probability  of  error. 

3.  TESTING  SUPERIORITY  OF  FIT 

Application  in  the  form  of  a statistical  test  requires  speci- 
fication of  the  null  hypothesis  and  sequential  data  collection  until 
the  rejection  of  this  hypothesis  occurs. 

Since  the  null  value  (0)  is  midway  between  the  two  predicted 
values  (0^  and  ©2)  , the  null  hypothesis  is  HQ : 0 = (0^  + ©2)/2.  The 
equality  sign  in  this  hypothesis  shows  that  the  test  is  two-tailed. 
Rejection  of  Hq  occurs  when  the  test  statistic  falls  in  either  tail 
of  the  sampling  distribution  that  the  test  statistic  would  have  if 
Hq  were  true.  The  decision  that  follows,  that  one  or  the  other 
prediction  is  more  accurate,  depends  on  which  tail  this  is.  Either 
decision  has  a probability  of  error  equal  to  a , the  area  under 
each  tail,  which  is  also  the  total  probability  of  error. 

Sequential  data  collection  is  necessary  to  avoid  the  acceptance 
of  Hq,  which,  according  to  the  belief  that  no  prediction  is  pre- 
cisely accurate,  is  impossible.  Sampling  thus  proceeds  one  sampling 
unit  at  a time.  Computation  of  the  test  statistic  (or  an  equivalent 
value)  T follows  the  sampling  of  each  sampling  unit  along  with  the 
determination  (if  necessary)  of  appropriate  critical  values,  t^ 
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and  t2  (t^ < tj.  The  decision  depends  on  the  relationship  between 
T and  t^  and  t2:  If  T < t^,  the  decision  is  that  prediction  1 

is  more  accurate  than  prediction  2;  if  T > t2 , the  decision  is  that 
prediction  2 is  more  accurate  than  prediction  1.  If  < T < t2, 

however,  the  decision  is  to  continue  sampling. 

4.  AN  ILLUSTRATION  OF  THE  METHOD 

On  the  first  day  of  school,  an  instructor  of  a large  class 
administers  a preliminary  examination  consisting  of  many  items, 
each  scorable  as  correct  or  incorrect.  Automatic  scoring  immedi- 
ately following  the  examination  shows  that  the  mean  proportion  of 
items  answered  correctly  is  ir . Knowing  ir , the  instructor  then 
asks  a student  selected  randomly  from  the  class  a question  selected 
randomly  from  the  examination  in  a process  that  continues,  if  nec- 
essary, for  a parallel  succession  of  students  and  questions  until  the 
answer  received  is  correct.  Recording  the  number  (X)  of  answers 
received  prior  to  the  correct  answer,  the  instructor  repeats  the 
process  for  trial  after  trial  in  order  to  determine  whether  a geo- 
metric distribution  with  parameter  n or  a Poisson  distribution  with 
parameter  (1  - it)  A more  accurately  predicts  the  variance  of  X. 

Each  of  these  distributions  predicts  the  same  mean:  y ■ (1  - it)/tt. 

2 .2 

If  the  variance  of  X is  for  the  Poisson  and  o2  for  the 

2 2 2 

geometric  distribution,  the  null  hypothesis  is  Hq : a = (a^  + a2)/2, 
where 

* (1  - it) A (4.1) 

and 


2 ,1  w 2 

o2  * (1  - tt)/w  • 


14.2) 
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The  test  statistic  is  chi  square  divided  by  its  degress  of  freedom: 


T - i (X  - u)2/No2,  (4.3) 

which  (u  being  known)  has  N degrees  of  freedom  after  the  sampling 

2 

of  N sampling  units.  Using  u and  a as  constants,  an  elec- 
tronic calculator  determines  T for  each  pair  of  X and  N values, 
and  the  instructor  plots  these  values  on  a graph  on  which  two  lines 
join  the  critical  values  t^  and  t2  for  successive  values  of  N. 
Figure  A shows  the  results  for  * = .2  and  a_  - .05.  After  five 
questions,, yielding  the  succession  of  X values  2,  5,  4,  6,  and  4, 
the  instructor  rejects  HQ,  deciding  with  a probability  of  error 
equal  to  .05  that  the  Poisson  distribution  predicts  the  variance 
more  accurately  than  the  geometric  distribution. 


i 


5.  DISCUSSION  OF  THE  ILLUSTRATION 


Since  the  succession  of  incorrect  and  correct  answers  consti- 
tutes a Bernoulli  process,  the  distribution  of  X ought  to  be  geo- 
metric, not  Poisson.  The  result,  however,  is  not  one  of  the  five 
errors  that  can  occur  in  every  one  hundred  repetitions  of  the  test. 
The  illustration  is  fictitious.  The  product  of  simulation,  the  five 
observations  tend  in  fact  to  follow  a Poisson  distribution  with 
parameter  equal  to  4 . The  histogram  in  Figure  B describes  this 
distribution.  As  a general  rule,  for  samples  as  small  as  five,  the 
probability  of  error  approximates  its  nominal  value  to  the  extent 
that  the  observations  follow  a normal  distribution.  Since  the  histo- 
gram in  Figure  B tends  to  be  unimodal  and  symmetrical  like  a normal 
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curve,  therefore,  the  total  probability  of  error  ought  to  be  close 
to  its  nominal  value  of  .05  despite  the  small  sample  size. 

Such  early  rejection  of  HQ  in  a sequential  test  is  ordinarily 
not  so  defensible.  If  the  sampling  distribution  of  X were  closer 
to  the  geometric  than  the  Poisson,  for  example,  the  histogram  in 
Figure  B might  be  skewed  sufficiently  to  have  a substantial  effect 
on  the  total  probability  of  error,  particularly  for  low  N.  Since 
five  observations  were  necessary  to  reject  Hq  in  favor  of  a 
Poisson  distribution  even  when  the  distribution  of  the  observations 
was  in  fact  Poisson,  however,  a sample  considerably  larger  than  five 
would  likely  be  necessary  to  reject  Hq  inappropriately  in  favor  of 
a Poisson  distribution.  The  requirement  of  large  samples  for  the 
occurrence  of  error  keeps  the  test  honest.  Regardless  of  the  form 
of  the  distribution  of  observations,  the  probability  of  error  gener- 
ally tends  more  and  more  closely  to  approximate  its  nominal  value  as 
samples  increase  in  size. 

Statistics  other  than  the  variance  tested  here  are,  of  course, 
also  possible  targets  of  inference  in  tests  of  superiority  of  fit. 

The  number  of  sampling  units  required,  however,  may  depend  on  the 
statistic  chosen  for  testing.  This  number  generally  ought  to  be 
smaller  for  predictions  that  are  far  apart  than  for  predictions  that 
are  close  together.  The  two  distributions  compared  here  thus  allowed 
no  choice:  The  predictions  of  the  mean  were  equal,  and  the  predic- 

tions of  the  variance  were  particularly  far  apart  (4  versus  20) . 
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